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Abstract 

o 

O Proteins interact with other proteins within biological pathways, forming 

^ connected subgraphs in the protein-protein interactome (PPI). However, pro- 

teins are often involved in multiple biological pathways which complicates in- 
ference about interactions between proteins. Gene expression data is informa- 
tive here since genes within a particular pathway tend to have more correlated 
^ expression patterns than genes from distinct pathways. We provide an algo- 

rithm that uses gene expression information to remove inter-pathway protein- 
protein interactions, thereby simplifying the structure of the protein-protein 
;~~| interactome. This refined topology permits easier interpretation of multiple 

1^ biological pathways simultaneously. 

d 
• ^ 

S 1 Introduction 

^ The protein-protein interactome (PPI) is a large graph where proteins are nodes 

^ and edges between these nodes represent all known interactions between proteins. 

^ In cases where proteins interact in order to drive a particular biological process, 
the connected nodes of a PPI can represent an entire biological pathway. However, 

Q inferring a biological pathway from the PPI is complicated by the fact that many 

O proteins are involved in multiple biological functions. Thus, a connected subgraph 

T1 of the PPI must be viewed as a mixture of smaller graphs that each represent a 

. ^ particular pathway. It is the goal of this paper to refine the PPI by isolating these 

^ smaller graph components which are more likely to contain just a single pathway. 



Our primary tool for this endeavor is gene expression data, which allows us to 
identify pairs of genes with highly correlated expression patterns. In general, gene 
pairs are more likely to have correlated expression if they belong to the same bi- 
ological pathway, which gives us a mechanism for refining the PPI to isolate in- 
dividual pathways. We introduce a procedure for reducing large connected com- 
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ponents of the PPI into smaller groups with higher connectivity that represent a 
single pathway. 



2 Methods 



The input for our procedure is a protein-protein interactome and a set of gene 
expression profiles. Our data sources are given in the supplementary materials, 
which can be downloaded at: 



http : / /stat . Wharton . upenn . edu/^st jensen/research/ppi . html 

Our algorithm focusses initially on proteins with the highest degree in the PPI as 
potential multi-pathway proteins. Expression profiles are used to infer interac- 
tions of this protein that span multiple pathways, and then the local topology of 
the PPI is edited to emphasize connectivity within single pathways. The overall 
framework of our algorithm is shown in Figure ll| 
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Figure 1: Overall framework of our PPI refining procedure. 
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Let A be the protein that currently has the largest number of connections in the 
PPL We denote J^{A) as a set containing all proteins connected to A via protein- 
protein interactions. For each pair of proteins i and j in M{A), we calculate the 
correlation pij of their gene expression patterns. An agglomerative hierarchical 
clustering of the proteins in J^{A) is performed using dij = 1 — \pij \ as the distance 
metric. A pre-specified threshold 6'cor is used to convert this hierarchical clustering 
into a partition of disjoint subsets A^i, A^2, • • • , Nm of highly correlated proteins, as 
well as an extra subset A^m+i containing unclustered proteins. 

We proceed under the assumption that each highly correlated subset A^^i, . . . , Nm 
of J^{A) is a group of proteins belonging to the same pathway. We remove inter- 
pathway connections within J^{A) by replacing protein A in the PPI with dupli- 
cates Ai, A2, . . . , A.m, where Ai retains only the connections between A and pro- 
teins in A^j. We also discard all connections between protein A and proteins con- 
tained in the unclustered set Nm+i- 

The expression clustering and network reduction steps are repeated for all highly- 
connected proteins in the protein-protein interactome. We terminate the algorithm 
when no protein in the refined PPI contains more connections than a pre-specified 
degree cutoff 6'deg- 



3 Evaluation using gene ontology 

We examine the effects of our algorithm using the gene ontology (GO) databas^ 
( [The Gene Ontology Consortiumj 2000[ ), which is a multi-level collection of bio- 



logical terms that are assigned to specific genes. The GO database contains three 
types of biological terms: cellular component, molecular function, and biological 
process. Molecular function is the most specific type, but many proteins either lack 
molecular function annotations or do not share a common annotation with other 
proteins. In contrast, most proteins are annotated with a cellular component GO 
term, but this feature is too broad to be particularly informative. We focus our 
analysis on biological process GO terms as the GO type most closely related to our 
goal of isolating biological pathways. 

For a group of connected proteins in the protein-protein interactome, we define an 
evaluation metric called the GO distance. The GO distance is the depth in the GO 
hierarchy of the deepest GO term that is common to all proteins in a connected 
group. The GO hierarchy becomes more specific as the depth increases, so large 
GO distances are indicative of a group of proteins that have high coherence in 
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terms of their biological processes. 



We quantify the improvement in coherence of the PPI by comparing the distribu- 
tions of GO distances over all genes in the original and the refined versions of the 
PPI. Specifically, we focus on the mean GO distances over all genes in the original 
PPI compared to our refinement of the PPI. In Figure |2| we examine the refined 
PPI for multiple versions of our procedure corresponding to different choices of 
the absolute co-expression cutoff 6'cor parameter and the degree cutoff 6'deg param- 
eter. 
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Figure 2: Mean GO Distance for different parameter cutoffs 



We see that for all choices of input parameters 6'cor and Odeg, the refined PPI from 
our procedure shows a dramatically larger mean GO distance than the original 
PPI. This result demonstrates that the refined protein connections from our pro- 
cedure have a much greater coherence in their biological processes compared to 
the original PPI. Although our procedure results in a PPI with greater biological 
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Figure 3: Protein count for different parameter cutoffs 

coherence, there is one sacrifice: some proteins are removed completely from the 
refined PPI due to all of their connections to other proteins being removed. We 
must balance our increase in biological coherence with the reduction in the num- 
ber of proteins contained in the PPI. In Figure |3| we give the number of proteins 
in the original PPI as well as the number of proteins remaining in the PPI for each 
version of our procedure given in Figure |2] 

Not surprisingly, stricter choices of the threshold parameters 9cor and O^eg lead to 
a PPI that has many proteins removed. Based on these results, we suggest param- 
eter values of 9 cor = 0.8 and 6'deg = 4 as a good compromise that gives increased 
biological coherence without the removal of too many proteins from the PPI. We 
provide our refined PPI under these parameter settings along with code for pro- 
ducing refined PPIs under other parameter settings at: 
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http : / /stat . Wharton . upenn . edu/^st jensen/research/ppi . html 

Our procedure depends not only on input parameters 6'cor and 6'deg but also on 
the ordering of proteins in the iterative portion of the algorithm. We investigated 
the results of using a random ordering to our iterative algorithm compared to the 
default setting of process proteins in order of highest degree first. Details are given 
on our supplementary materials, but we found that our default choice of proteins 
ordered by highest degree was superior to random orderings. 

In summary, we have provided an iterative algorithm that uses gene expression 
data to refine the protein-protein tnteractome. Our evaluation suggests that our 
refined PPI has greater biological coherence than the original PPI, at least in terms 
of the GO biological processes category. 
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